- Title
- An integrated, fast and scalable approach for large-scale biological network analysis
- Creator
- Arefin, Ahmed Shamsul
- Relation
- University of Newcastle Research Higher Degree Thesis
- Resource Type
- thesis
- Date
- 2013
- Description
- Research Doctorate - Computer Science
- Description
- THE amount of data in our world has been exploding. Computer-based methods used to analyze data ten years ago are impractical today, as the continuously evolving data acquiring technologies are producing more raw data than these methods can handle. For instance, today’s high throughput technologies like DNA microarrays can produce millions of data elements from a particular experiment, where most of the relevant analysis tools are designed to work with only a few tens of thousands. Even though the scalability of these methods/tools may be improved by porting the relevant implementations to a highly expensive super-computer or a cluster of computers, their existing fully connected data representation model can still pose many other restrictions. In this work, instead of using the traditional distance matrix based microarray data analysis model, we propose to use a novel, fast and scalable κ-Nearest Neighbor (κNN) graph-based approach. Moreover, instead of constructing the graph/network on a highly expensive system, we show its construction on graphics processing units (GPUs), which are now widely available as inexpensive, highly parallel devices. The outcome of our κNN graph construction method (termed as GPU-FS-κNN) can be used to carry out many other important computational tasks. In particular, we demonstrate its applications in two popular data analysis methods: clustering and centrality analysis. To do this, we first propose a GPU-based fast method for constructing minimum spanning trees (MST) from the κNN graphs (termed as κNN-Borůvka) and a method for partitioning the trees in an agglomerative fashion (termed as κNN-Borůvka-Agglomerative). Then, we demonstrate the use of κNN graphs in accelerating and scaling the computations of two degree-based (e.g., degree and eigenvectors) and three shortest path based (closeness, eccentricity and betweenness) centrality metrics. At the end, we integrate the developed methods and combinedly apply them on two publicly available gene-expression data sets (Alzheimer’s disease and breast cancer) and their large-scale artificial expansions. Our investigations show that the proposed integrated approach can find both numerically and biologically significant results. We also demonstrate the method’s application in extracting a robust set of gene markers that may warrant further investigations, due to their conspicuous positions in our results.
- Subject
- data clustering; centrality analysis; GPU-based computation; microarray-based data analysis
- Identifier
- http://hdl.handle.net/1959.13/938499
- Identifier
- uon:12629
- Rights
- Copyright 2013 Ahmed Shamsul Arefin
- Language
- eng
- Full Text
- Hits: 1089
- Visitors: 1511
- Downloads: 504
Thumbnail | File | Description | Size | Format | |||
---|---|---|---|---|---|---|---|
View Details Download | ATTACHMENT01 | Abstract | 186 KB | Adobe Acrobat PDF | View Details Download | ||
View Details Download | ATTACHMENT02 | Thesis | 13 MB | Adobe Acrobat PDF | View Details Download |